In this project, I went over a basic Python data analysis pipeline from start to finish. In the second part of this notebook, I analyzed factors that correlated with the SAT scores, and did some exploratory visualization and analysis.
To explore a dataset and see what columns are related to the one cared about (SAT score in this case) is to compute correlations. This will tell which columns are closely related to the SAT score. We can do this via the corr method on Pandas DataFrames. The closer to 0 the correlation, the weaker the connection. The closer to 1, the stronger the positive correlation, and the closer to -1, the stronger the negative correlation:
import pandas as pd
full=pd.read_csv('full.csv')
full.corr()['sat_score']
This gives us quite a few insights that we'll need to explore:
Before we dive into exploring the data, we'll want to set the context, both for ourselves, and anyone else that reads our analysis. One good way to do this is with exploratory charts or maps. In this case, we'll map out the positions of the schools, which will help readers understand the problem we're exploring. In the below code, we:
import folium
from folium import plugins
import pandas as pd
full=pd.read_csv('full.csv')
schools_map = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
marker_cluster = folium.plugins.MarkerCluster().add_to(schools_map)
for name, row in full.iterrows():
folium.Marker(location = [row["lat"], row["lon"]], popup="{0}: {1}").add_to(marker_cluster)
schools_map.save("schools.html")
schools_map
This map is helpful, but it's hard to see where the most schools are in NYC. Instead, we'll make a heatmap:
schools_heatmap = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
schools_heatmap.add_child(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in full.iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap
Heatmaps are good for mapping out gradients, but we'll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area. We can compute SAT score by school district, then plot this out on a map. In the below code, we'll:
import numpy as np
district_data = full.groupby("school_dist").agg(np.mean)
district_data.reset_index(inplace=True)
district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)))
district_data
Store the district_data in csv format.
district_data.to_csv('district_data.csv')
We'll now we able to plot the average SAT score in each school district. In order to do this, we'll read in data in GeoJSON format to get the shapes of each district, then match each district shape with the SAT score using the school_dist column, then finally create the plot:
import json
import pandas as pd
def show_district_map(col):
full=pd.read_csv('full.csv')
district_data=pd.read_csv('district_data.csv')
geo_data =r'schoolDistricts.json'
geo_json_data=json.load(open(geo_data))
#Create a map:
districts = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
#Create a layer, shaded by col:
folium.Choropleth(
geo_json_data,
data=district_data,
columns=['school_dist', col],
key_on='feature.properties.SchoolDist',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
).add_to(districts)
#Output the map to an .html file:
districts.save(outfile='districts_1.html')
return districts
Now that we've set the context by plotting out where the schools are, and SAT score by district, people viewing our analysis have a better idea of the context behind the dataset. Now that we've set the stage, we can move into exploring the angles we identified earlier, when we were finding correlations. The first angle to explore is the relationship between the number of students enrolled in a school and SAT score. We can explore this with a scatter plot that compares total enrollment across all schools to SAT scores across all schools.
%matplotlib inline
full.plot.scatter(x='total_enrollment', y='sat_score')
As you can see, there's a cluster at the bottom left with low total enrollment and low SAT scores. Other than this cluster, there appears to only be a slight positive correlation between SAT scores and total enrollment. Graphing out correlations can reveal unexpected patterns. We can explore this further by getting the names of the schools with low enrollment and low SAT scores:
full[(full["total_enrollment"] < 1000) & (full["sat_score"] < 1000)]["School Name"]
Some searching on Google shows that most of these schools are for students who are learning English, and are low enrollment as a result. This exploration showed us that it's not total enrollment that's correlated to SAT score -- it's whether or not students in the school are learning English as a second language or not.
Now that we know the percentage of English language learners in a school is correlated with lower SAT scores, we can explore the relationship. The ell_percent column is the percentage of students in each school who are learning English. We can make a scatterplot of this relationship:
full.plot.scatter(x='ell_percent', y='sat_score')
It looks like there are a group of schools with a high ell_percentage that also have low average SAT scores. We can investigate this at the district level, by figuring out the percentage of English language learners in each district, and seeing it if matches our map of SAT scores by district:
show_district_map("ell_percent")
show_district_map("sat_score")
As we can see by looking at the two district level maps, districts with a low proportion of ELL learners tend to have high SAT scores, and vice versa.
It would be fair to assume that the results of student, parent, and teacher surveys would have a large correlation with SAT scores. It makes sense that schools with high academic expectations, for instance, would tend to have higher SAT scores. To test this theory, lets plot out SAT scores and the various survey metrics:
full.corr()["sat_score"][["rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_tot_11", "com_tot_11", "aca_tot_11", "eng_tot_11"]].plot.bar()
Surprisingly, the two factors that correlate the most are N_p and N_s, which are the counts of parents and students who responded to the surveys. Both strongly correlate with total enrollment, so are likely biased by the ell_learners. The other metric that correlates most is saf_t_11. That is how safe students, parents, and teachers perceived the school to be. It makes sense that the safer the school, the more comfortable students feel learning in the environment. However, none of the other factors, like engagement, communication, and academic expectations, correlated with SAT scores.
One of the other angles to investigate involves race and SAT scores. There was a large correlation differential, and plotting it out will help us understand what's happening:
full.corr()["sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar()
It looks like the higher percentages of white and asian students correlate with higher SAT scores, but higher percentages of black and hispanic students correlate with lower SAT scores. For hispanic students, this may be due to the fact that there are more recent immigrants who are ELL learners. We can map the hispanic percentage by district to eyeball the correlation:
show_district_map("hispanic_per")
show_district_map("ell_percent")
The final angle to explore is the relationship between gender and SAT score. We noted that a higher percentage of females in a school tends to correlate with higher SAT scores. We can visualize this with a bar graph:
full.corr()["sat_score"][["male_per", "female_per"]].plot.bar()
To dig more into the correlation, we can make a scatterplot of female_per and sat_score:
full.plot.scatter(x='female_per', y='sat_score')
It looks like there's a cluster of schools with a high percentage of females, and very high SAT scores (in the top right). We can get the names of the schools in this cluster:
full[(full["female_per"] > 65) & (full["sat_score"] > 1400)]["School Name"]
Searching Google reveals that these are elite schools that focus on the performing arts. These schools tend to have higher percentages of females, and higher SAT scores. This likely accounts for the correlation between higher female percentages and SAT scores, and the inverse correlation between higher male percentages and lower SAT scores.
So far, we've looked at demographic angles. One angle that we have the data to look at is the relationship between more students taking Advanced Placement exams and higher SAT scores. It makes sense that they would be correlated, since students who are high academic achievers tend to do better on the SAT.
full["ap_avg"] = full["AP Test Takers "] / full["total_enrollment"]
full.plot.scatter(x='ap_avg', y='sat_score')
It looks like there is indeed a strong correlation between the two. An interesting cluster of schools is the one at the top right, which has high SAT scores and a high proportion of students that take the AP exams:
full[(full["ap_avg"] > .3) & (full["sat_score"] > 1700)]["School Name"]
Some Google searching reveals that these are mostly highly selective schools where you need to take a test to get in. It makes sense that these schools would have high proportions of AP test takers.